Visual Multivariate Analysis - Starbucks Drinks

Take-Home Exercise 3

Zhang Jieyuan
2022-02-20

1.0 Introduction

The aim of this exercise is to create a data visualisation to segment kids drinks and others by nutrition indicators. For this task, starbucks_drink.csv is used.

We will look into how different drinks in the kids drinks and other segment are associated with one another using the following:

2.0 Data Preparation

For this exercise, the GGally, parcoords, parallelPlot, seriation, dendextend, heatmaply and tidyverse packages will be used. The below choke chunks loads the necessary packages.

packages = c('GGally','parcoords', 'parallelPlot', 'tidyverse','seriation', 'dendextend', 'heatmaply')

for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

Load data and filter out the “kids drinks and other” segment. Add a new column to combine name with size, milk type and if there is whipped cream.

drinks<-read_csv("data/starbucks_drink.csv")

kid_others<-drinks %>%
  filter(Category=="kids-drinks-and-other")

kid_others$Name_with_Desc <-paste(kid_others$Name,kid_others$Size,kid_others$Milk,kid_others$`Whipped Cream`)

Reorder the data according to the sequence preferred.

kid_others2 <-kid_others[,c(2,19,4:18)]

row.names(kid_others2)<-kid_others2$Name_with_Desc

3.0 Charts

3.1 Parallel Coordinates with Histogram

Interactive parallel coordinates with histogram is plot with the following code with parallelPlot() and histoVisibility. We are able to see how the different nutrient indicators are related to one another for each drink. Additionally, we can also observe the distribution of drinks within different bins for each indicator using the histogram.

For example, by selecting drinks with calories above 600, we are able to observe that all three drinks are high in sodium, total carbohydrate and sugar, while two out of the three drinks are high in trans fat.

histoVisibility <- rep(TRUE, ncol(kid_others2[,2:14]))

parallelPlot(kid_others2[,2:14],
             rotateTitle = TRUE,
             histoVisibility = histoVisibility)

3.2 Parallel Coordinates by Name

By plotting parallel coordinates with facets using ggparcoord(), it enables us to dive deeper by splitting them into factors/categories. In this case, each subplot represent one drink, while the lines represents its variations in size, milk type and whipped cream. The y values are scaled in ggparcoord(), hence, the y axis is between 0-1.

ggparcoord(data = kid_others2, 
           columns = c(3:14), 
           groupColumn = 1,
           scale = "uniminmax",
           alphaLines = 0.2,
           boxplot = TRUE, 
           title = "Multiple Parallel Coordinates Plots of Nutrition factor by Name"
           ) +
  facet_wrap(~ Name)+theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Further filter is done to see how nutrition indicators differ across the drinks with and without whipped cream.

Without Whipped Cream:

ggparcoord(data = (kid_others2 %>% filter(`Whipped Cream`=="No Whipped Cream")), 
           columns = c(3:14), 
           groupColumn = 1,
           scale = "uniminmax",
           alphaLines = 0.2,
           boxplot = TRUE, 
           title = "Nutrition factor by Name with No Whipped Cream"
           ) +
  facet_wrap(~ Name)+theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

With Whipped Cream:

ggparcoord(data = (kid_others2 %>% filter(`Whipped Cream`=="Whipped Cream")), 
           columns = c(3:14), 
           groupColumn = 1,
           scale = "uniminmax",
           alphaLines = 0.2,
           boxplot = TRUE, 
           title = "Nutrition factor by Name with Whipped Cream"
           ) +
  facet_wrap(~ Name)+theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

From the chart above, we observed that some drinks with whipped cream contains the trans fat.

3.3 Interactive Cluster Heatmap

Create data matrix before plotting the heatmap.

kid_others3 <-kid_others2[,3:14]
row.names(kid_others3)<-kid_others2$Name_with_Desc
kid_others_matrix <-data.matrix(kid_others3)

Plot the interactive heatmap with heatmaply().

heatmaply(kid_others_matrix,column_text_angle = 90, fontsize_row=6)

Normalized heatmap:

Using normalize function on the data to bring them to 0-1 scale by subtracting the minimum and dividing by the maximum of all observations.

While preserving the distribution of each variable, the different variables can now be compared after normalization.

heatmaply(normalize(kid_others_matrix),column_text_angle = 90,fontsize_row=6)

4.0 Key Challenge

The key challenge is that there are many drinks in the kids and other segment. Although heatmap and parallel coordinate plots both enable us to compare how the nutrition indicator varies across the drinks, further splitting of the plots or making use of the interactive function to zoom in to a particular area might be needed to generate insights.

5.0 References